1 Objective:

  • Remove data on landowners who own a portion of his farmland in both inside and outside of the buffer of LR side.
  • Aggregate the data to make it to annual water usage data by landowner and conduct CF-analysis.

1.1 Data

# === setup === #
library(here)
source(here("GitControlled/Codes/0_1_ls_packages.R"))
source(here("GitControlled/Codes/0_0_functions.R"))

#/*--------------------------------*/
#' ## Data
#/*--------------------------------*/
# === NRD boundary === #
nrd_boud <- 
 here("Shared/Data/WaterAnalysis/NRD_bd/BND_NaturalResourceDistricts_DNR.shp") %>%
 st_read() %>%
 filter(NRD_Name %in% c("Lower Republican", "Tri-Basin")) %>%
 st_transform(4269)
Reading layer `BND_NaturalResourceDistricts_DNR' from data source `/Users/shunkeikakimoto/Dropbox/ResearchProject/HeterogeneousAllocation/Shared/Data/WaterAnalysis/NRD_bd/BND_NaturalResourceDistricts_DNR.shp' using driver `ESRI Shapefile'
Simple feature collection with 23 features and 6 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -11583170 ymin: 4865934 xmax: -10609670 ymax: 5312233
Projected CRS: WGS 84 / Pseudo-Mercator
# === well are located in east side of US Highway 183?  === #
east_or_west <- 
  here("Shared/Data/WaterAnalysis/east_or_west.rds") %>%
  readRDS()


# === Data period: 2008-2015  === #
reg_data_all <- 
  here("Shared/Data/WaterAnalysis/comp_reg_dt.rds") %>%
  readRDS() %>%
  .[,owner_name := paste0(firstname, "_", lastname)] %>%
  .[year %in% 2008:2015 & usage <= 45]
  # east_or_west[., on = "wellid"]


# === Data period: 2008-2012 === #
reg_data <- 
  reg_data_all %>%
  .[year %in% 2008:2012]

2 Identify landowners whose farmland locates in both inside and outside of the five-mile buffer

2.1 Identify landowners in LR who have farmlands outside of the buffer

# === All Well Data (Original data)=== #
ir_data <- 
  here('Shared/Data/ir_reg.rds') %>%
  readRDS() %>%
  data.table() %>%
  .[source %in% c('Meter','METER','metered') & nrdname %in% c("Lower Republican", "Tri-Basin")] %>%
  .[,owner_name := paste0(firstname, "_", lastname)]
  # .[year %in% 2008:2012]

# --- check: Low_Tri_5mi --- #
ggplot() + 
  geom_sf(data=nrd_boud) +
  geom_sf(data=st_as_sf(ir_data, coords = c("longdd","latdd"), crs = 4269), 
    aes(color=factor(Low_Tri_5mi)), size=0.5)

  • NOTE:
    • Remember that some farmers own farmlands in both NRDs. We regard those farmlands as if they are owned by two different owners in each NRD.
      • Total number of landowners in the data (before we exclude some farmers from data): 770:
        • only in LR: 325
        • only in TB: 383
        • Both: 62


  • We want to remove only the fields related to farmers who also have farmland outside of the buffer in LR side. If we simply use owners’ name to filter the data, we might also eliminate the farmland that owner owened in TB. So a special care is required to distinguish those owners who owns multiple fields in both LR and TB.
    • We use landowner name associated with NRD name (ex. "Lower Republican ownerA")to filter the data.


  • The below creates a list of landowners who own at least one of their farmlands outside of the buffer in LR.
outside_5mi_LR <- 
  ir_data %>%
  .[,nrd_owner_name := paste0(nrdname, " ", owner_name) ] %>%
  .[nrdname=="Lower Republican" & Low_Tri_5mi==0, nrd_owner_name] %>%
  unique()


  • Then, in the regression data(2008-2012), create a column of binary index both_InOut_5mi_LR showing whether a landowner has farmlands in both inside and outside of the buffer facing LR so that we can distinguish them later.


# create "both_InOut_5mi" index
reg_data <- 
  reg_data %>% 
  .[,nrd_owner_name := paste0(nrdname, " ", owner_name) ] %>%
  .[, both_InOut_5mi_LR := ifelse(nrd_owner_name %in% outside_5mi_LR, 1, 0)]


  • After distinguishing owners who owns multiple fields in both LR and TB, the total number of farmers included in the data turned out to be: 832:
    • LR: 387
    • TB: 445


# Don't need to run
reg_data %>%
  unique(., by="nrd_owner_name") %>%
  .[, .N, by=.(nrdname, both_InOut_5mi_LR) ] %>%
  .[order(nrdname)] %>%
  print()
            nrdname both_InOut_5mi_LR   N
1: Lower Republican                 0 306
2: Lower Republican                 1  81
3:        Tri-Basin                 0 445
  • The above table shows the number of farmers for each NRD. both_InOut_5mi_LR=1 indicates the farmers who also owns farmland outside of the 5-mile buffer of LR side, and the number fo those farmers is 81.
    • Data related to these 81 farmers is removed from the regression data.
    • The total number of farmers included in the data is 306+445=751.


3 New Data (Annual water-use data by owner)

3.1 Flow

  • Use the new criteria both_InOut_5mi_LR to filter the data.
  • For LR data, aggregate the annual data by field to annual data by owner.
  • For TB data, field-level annual data is used. (Aggregation is not necessary)
# Remove owners who have farmlands in both inside and outside of the buffer
mod_reg_data <- 
  reg_data[both_InOut_5mi_LR==0,]

# - Data on TB - #
reg_data_TB <- 
  mod_reg_data[nrdname=="Tri-Basin",]

# - Data on LR - #
reg_data_LR <- 
  mod_reg_data[nrdname=="Lower Republican",]

3.2 Problem

  • Which variable should be used for clustering?
    • I used to use tr (combinations of “township” and “range” index).
  • The value of tr is not one, if a farmer has farmland across several “range” .
# Don't need to run
# Ideally, a single farmers have only one single "tr". Check how many of farmers whose farmland is scattered across multiple "tr"

tr_cl <- 
  reg_data_LR %>%
  distinct(nrd_owner_name, tr) %>%
  .[, index := 1,] %>%
  dcast(.,nrd_owner_name ~ tr, value.var = "index") %>%
  replace(is.na(.), 0) %>%
  .[, total := rowSums(.SD), by = nrd_owner_name]

# tr_cl[total>1, ]
  • 21 farmers in LR have farmlands across several tr segments.


  • We need some new geographic segmentation index which covers larger area.
    • Alternatives:
      • County (This time, I used county.)
      • ?


# Don't need to run
county_cl <- 
  reg_data_LR %>%
  distinct(nrd_owner_name, county) %>%
  .[, index := 1,] %>%
  dcast(.,nrd_owner_name ~ county, value.var = "index") %>%
  replace(is.na(.), 0) %>%
  .[, total := rowSums(.SD), by = nrd_owner_name]

# county_cl[total>1, nrd_owner_name]
county_cl[total>1, ] %>% print()
                            nrd_owner_name FRANKLIN FURNAS HARLAN KEARNEY total
1:            Lower Republican David_Black        1      0      1       0     2
2: Lower Republican John & Ingrid_Tangeman        1      1      0       0     2
3:  Lower Republican Wayne & Linda_Brummer        1      0      1       0     2
4:  Lower Republican _Franklin County Farm        1      0      0       1     2
  • Still there are four farmers who have farmland across several counties.
    • The above table shows whether a farmer has farmland in each of those counties. If farmland located in a county, 1, otherwise 0.
# Don't need to run
ne_county_sf <-  tigris::counties("Nebraska", cb = TRUE, resolution="20m", progress_bar = FALSE) %>%
  dplyr::select(COUNTYFP, NAME) %>%
  filter(NAME %in% str_to_sentence(unique(mod_reg_data$county)))%>%
  st_transform(4269)

county_cl_sf <-
  reg_data_LR %>%
  .[nrd_owner_name %in% county_cl[total>1, nrd_owner_name],] %>%
  distinct(nrd_owner_name, longdd, latdd) %>%
  st_as_sf(., coords = c("longdd","latdd"), crs = 4269)

ggplot()+
  geom_sf(data=ne_county_sf, aes(fill=NAME), alpha=0.6) +
  geom_sf(data=nrd_boud, color="blue", fill=NA) +  
  geom_sf(data=county_cl_sf, size=1)+
  facet_wrap(~nrd_owner_name, ncol=2)

# reg_data_LR %>%
#   .[nrd_owner_name == "Lower Republican John & Ingrid_Tangeman"] %>%
#   distinct(nrd_owner_name, longdd, latdd, .keep_all = TRUE)

# reg_data_LR %>%
#   .[nrd_owner_name == "Lower Republican David_Black"] %>%
#   distinct(nrd_owner_name, longdd, latdd, .keep_all = TRUE)
  • Each panel in the figure above shows how farmland of each of those four farmers is distributed.

    • Interestingly, each of two farm belonging to “Franklin County Farm” locates in TB and LR respectively although they were subjected to the groundwater allocation limit on data.


    • Each of the farmland of “John & Ingrid_Tangeman” and “David_Black” are far apart, respectively. (Cannot be helped)
      • “John & Ingrid_Tangeman” farmland: Furnas county
        • because the irrigated acres in the farmland in Furnas county is much lager the other one
      • “David_Black” farmland: Harlan county
        • because two out of the three farmland is located in Harlan county


    • Regarding the farmland of “Franklin County Farm” and ” Wayne & Linda_Brummer”, it looks okay to regard them as being located in the same county.
      • “Franklin County Farm” farmland : Franklin county
      • “Wayne & Linda_Brummer” farmland : Harlan county
# reg_data_LR$county %>% unique()

reg_data_LR2 <- 
  reg_data_LR %>%
  .[,county_fix := case_when(
    # - first group - #
    nrd_owner_name == "Lower Republican John & Ingrid_Tangeman" ~ "FURNAS",
    nrd_owner_name == "Lower Republican David_Black" ~ "HARLAN",
    # - second group  - #
    nrd_owner_name == "Lower Republican _Franklin County Farm" ~ "FRANKLIN",
    nrd_owner_name == "Lower Republican Wayne & Linda_Brummer" ~ "HARLAN",
    TRUE ~ county
  )]

# county_cl2 <- 
#   reg_data_LR2 %>%
#   distinct(nrd_owner_name, county_fix) %>%
#   .[, index := 1,] %>%
#   dcast(.,nrd_owner_name ~ county_fix, value.var = "index") %>%
#   replace(is.na(.), 0) %>%
#   .[, total := rowSums(.SD), by = nrd_owner_name]

# county_cl2[total>1, ]
  • Now, each farmers in LR have farmland in a single county

3.3 Data aggregation by taking mean values

reg_data_LR_mean <- 
  reg_data_LR2 %>%
    .[,.(
    usage = mean(usage),
    treat2 = mean(treat2),
    # --- soil --- #
    silttotal_r = mean(silttotal_r),
    claytotal_r = mean(claytotal_r),
    slope_r = mean(slope_r),
    ksat_r = mean(ksat_r),
    awc_r = mean(awc_r),
    # --- weather --- #    
    pr_in = mean(pr_in),
    pet_in = mean(pet_in),
    gdd_in = mean(gdd_in),
    # --- tr --- #
    county = unique(county_fix)
    ), by = .(nrd_owner_name, year)] %>%
    .[,county_year := paste0(county, "_", year)]

reg_data_TB_final <-
  reg_data_TB %>%
  .[, names(reg_data_LR_mean) %>% .[-length(.)], with=FALSE] %>%
  .[,county_year := paste0(county, "_", year)]


reg_data_final <- rbind(reg_data_LR_mean, reg_data_TB_final)

# hist(reg_data_LR_mean$usage)

4 CF analysis

4.1 No clustering

  • Important variables:
cov_ls[selected_vars] %>% print()
[1] "silttotal_r" "awc_r"      

4.2 Clustering with county_year

4.2.1 1st CF

vis_cf_raw_cl

4.2.2 2nd CF

vis_cf_res

4.3 Next step

  • Below is the what came up in my mind:
  • The usage is the average amount of water applied to the field (unit is inches).
  • Then, I took a simple average of usage by owner
  • So, I’m taking average of average. This could be a problematic if the sizes of farmlands belonging to the same owner are different
  • For example, suppose owner A has two fields:
    • Field 1: area 200 acre, usage 140 acre-inch, so 0.7 inch
    • Field 2: area 100 acre, water use 50 acre-inch, so 0.2 inch
    • Simple average of usage is \(\frac{0.7+0.2}{2}=0.45\)
    • Using a weighted average, \(\frac{200}{200+100}\times0.7 + \frac{100}{200+100}\times0.5=\frac{140+50}{300}=0.63 > 0.45\)


  • So I needed to calculate with weighted average (weighted by acres).

4.4 Number of owners grouped by number of farmlands owned (in LR)

Landowners categorized by the number
of farmlands owned
num_fields num_farmers
1 162
2 79
3 29
4 12
5 8
6 7
7 3
8 1
9 1
10 3
14 1
  • Total number of landowners in LR is 306
  • Out of this number of landowners, 144 of landowners own more than one farmlands.

5 Aggregate well-level data on LR to owner-level data correctly (~4/28)

5.1 MEMO for myself:

  • usage:=volaf*12/acres
    • volaf: is a measurement of the volume of water used on the farmland (unit: acre-foot)
    • volaf*12: the measurement unit acre-inches
      • 1 acre-foot = 12 acre-inches
    • So, by dividing volaf*12 by acres, it becomes volume of water inches (per acre) (acre-inches by acre)
      • the unit of usage is inches

5.2 Correct way of aggregation by owner

  • an average water usage by owner = (sum of volaf*12)/(sum of acres)


  • The original data (ir_reg.rds) has two columns acres of fields : i.acres and acres.
  • Check which is the correct one:
temp <- reg_data_LR2[1]
temp$usage == temp$volaf*12/temp$i.acres
[1] FALSE
temp$usage == temp$volaf*12/temp$acres
[1] TRUE
  • acres is the correct one.


  • Aggregate data by year and owner in the below.
cov_ls <- c(
  # --- weather --- #
  "pr_in","pet_in",
  # --- soil --- #
  "silttotal_r", "claytotal_r", "slope_r", "ksat_r", "awc_r"
  )

var_ls <- c("usage", "treat2", cov_ls)

# === LR data === #
reg_data_LR_mean <- 
  reg_data_LR2 %>%
    .[,.(
    usage_avg = mean(usage),
    usage = sum(volaf*12)/sum(acres),
    treat2 = mean(treat2),
    # --- soil --- #
    silttotal_r = mean(silttotal_r),
    claytotal_r = mean(claytotal_r),
    slope_r = mean(slope_r),
    ksat_r = mean(ksat_r),
    awc_r = mean(awc_r),
    # --- weather --- #    
    pr_in = mean(pr_in),
    pet_in = mean(pet_in),
    gdd_in = mean(gdd_in),
    # --- tr --- #
    county = unique(county_fix)
    ), by = .(nrd_owner_name, year)] %>%
    .[,county_year := paste0(county, "_", year)]

# === TB data === #
reg_data_TB_final <-
  reg_data_TB %>%
  .[,county_year := paste0(county, "_", year)]

# === Merge LR and TB data === #
reg_data_final2 <- 
  rbind(
    reg_data_LR_mean[, c(var_ls, "county_year"), with=FALSE],
    reg_data_TB_final[, c(var_ls, "county_year"),
     with=FALSE]
     )

5.3 Compare usage_avg = mean(usage) and usage = sum(volaf*12)/sum(acres)

+ Green is sum(volaf*12)/sum(acres) and Blue is mean(usage) + Overall, sum(volaf*12)/sum(acres) is slightly smaller than mean(usage)


5.4 CF analysis

data <- reg_data_final2
Y <- data[, usage]
T <- data[, treat2]
X <- data[, ..cov_ls]
cl_var <- data[, county_year] %>% factor()

#/*--------------------------------*/
#' ## 1st CF
#/*--------------------------------*/
set.seed(23456)

forest_W_cl <- regression_forest(
  X, T, 
  clusters = cl_var,
  num.trees = 4000)
W_hat <- predict(forest_W_cl)$predictions

forest_Y_cl <- regression_forest(
  X, Y, 
  clusters = cl_var,
  num.trees = 4000)
Y_hat <- predict(forest_Y_cl)$predictions

cf_raw_cl <- causal_forest(
  X=X,
  Y=Y,
  W=T, 
  Y.hat = Y_hat, 
  W.hat = W_hat,
  clusters = cl_var,
  tune.parameters = "all",
  num.trees = 4000
)

varimp = variable_importance(cf_raw_cl)
selected_vars = which(varimp > mean (varimp))
# selected_vars = which(varimp/mean(varimp) > 0.2)
se_cov <- data[,cov_ls[selected_vars], with=FALSE]

vis_cf_raw_cl <- gen_impact_viz(
  cf_res= cf_raw_cl,
  data_base=data,
  treat_var='treat2',
  var_ls= cov_ls,
  var_ls_int = cov_ls
)
# vis_cf_raw_cl

#/*--------------------------------*/
#' ## 2nd CF
#/*--------------------------------*/
cf_res <- causal_forest(
  X=se_cov,
  Y=Y,
  W=T, 
  Y.hat = Y_hat, 
  W.hat = W_hat,
  clusters = cl_var,
  # sample.fraction = 0.4,
  num.trees = 4000,
  tune.parameters = "all",
  # tune.parameters = c(""),
)

#/*--------------------------------*/
#' ## Visualization
#/*--------------------------------*/
vis_cf_res <-
  gen_impact_viz(
  cf_res= cf_res,
  data_base=data,
  treat_var='treat2',
  var_ls= cov_ls[selected_vars],
  var_ls_int = cov_ls[selected_vars]
)
# vis_cf_res

5.4.1 1st CF

vis_cf_raw_cl

5.4.2 2nd CF

vis_cf_res